Programming reference.md - lectures.alex.balgavy.eu

Programming reference.md (6735B)
      1 +++
      2 title = 'Programming reference'
      3 +++
      4 # Numpy & matplotlib
      5 Load external file:
      6 ```python
      7 data = numpy.loadtxt('./filepath.csv', delimiter=',')
      8 ```
      9 
     10 Print information about data:
     11 
     12 ```python
     13 data.shape
     14 ```
     15 
     16 Graph two columns of data:
     17 
     18 ```python
     19 import matplotlib.pyplot as plt
     20 %matplotlib inline
     21 x = data[:,0]
     22 y = data[:,1]
     23 # includes size and transparency setting, specifies third column to use for color
     24 plt.scatter(x, y, s=3, alpha=0.2, c=data[:,2], cmap='RdYlBu_r')
     25 plt.xlabel('x axis')
     26 plt.ylabel('y axis');
     27 ```
     28 
     29 Histogram plotting:
     30 
     31 ```python
     32 # bins determines width of bars
     33 plt.hist(data, bins=100, range=[start, end]
     34 ```
     35 
     36 The identity matrix:
     37 
     38 ```python
     39 np.eye(2) # for a 2x2 matrix
     40 ```
     41 
     42 Matrix multiplication:
     43 
     44 ```python
     45 a * b       # element-wise
     46 a.dot(b)    # dot product
     47 ```
     48 
     49 Useful references:
     50 * [The official numpy quickstart guide](https://docs.scipy.org/doc/numpy-dev/user/quickstart.html)
     51 * [A more in-depth tutorial, with in-browser samples](https://www.datacamp.com/community/tutorials/python-numpy-tutorial)
     52 * [A very good walk through the most important functions and features](http://cs231n.github.io/python-numpy-tutorial/). From the famous [CS231n course](http://cs231n.github.io/), from Stanford.
     53 * [The official pyplot tutorial](https://matplotlib.org/users/pyplot_tutorial.html). Note that pyplot can accept basic python lists as well as numpy data.
     54 * [A gallery of example MPL plots](https://matplotlib.org/gallery.html). Most of these do not use the pyplot state-machine interface, but the more low level objects like [Axes](https://matplotlib.org/api/axes_api.html).
     55 * [In-depth walk through the main features and plot types](http://www.scipy-lectures.org/intro/matplotlib/matplotlib.html)
     56 
     57 
     58 # Sklearn
     59 Split data into train and test, on features `x` and target `y`:
     60 
     61 ```python
     62 from sklearn.model_selection import train_test_split
     63 x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.5)
     64 ```
     65 
     66 An estimator implements method `fit(x,y)` that learns from data, and `predict(T)` which takes new instance and predicts target value.
     67 
     68 Linear classifier, using SVC model with linear kernel:
     69 
     70 ```python
     71 from sklearn.svm import SVC
     72 linear = SVC(kernel='linear')
     73 linear.fit(x_train, y_train)
     74 ```
     75 
     76 Decision tree classifier:
     77 
     78 ```python
     79 from sklearn.tree import DecisionTreeClassifier
     80 tree = DecisionTreeClassifier()
     81 tree.fit(x_train, y_train)
     82 ```
     83 
     84 k-Nearest Neighbors:
     85 
     86 ```python
     87 from sklearn.neighbors import KNeighborsClassifier
     88 knn = KNeighborsClassifier(15) # We set the number of neighbors to 15
     89 knn.fit(x_train, y_train)
     90 ```
     91 
     92 Try to classify new data:
     93 
     94 ```python
     95 linear.predict(some_data)
     96 ```
     97 
     98 Compute accuracy on testing data:
     99 
    100 ```python
    101 from sklearn.metrics import accuracy_score
    102 y_predicted = linear.predict(x_test)
    103 accuracy_score(y_test, y_predicted)
    104 ```
    105 
    106 Make a plot of classification, with colors showing classifier's decision:
    107 
    108 ```python
    109 from mlxtend.plotting import plot_decision_regions
    110 plot_decision_regions(x_test[:500], y_test.astype(np.integer)[:500], clf=linear, res=0.1);
    111 ```
    112 
    113 Compare classifiers via ROC curve:
    114 
    115 
    116 ```python
    117 from sklearn.metrics import roc_curve, auc
    118 
    119 # The linear classifier doesn't produce class probabilities by default. We'll retrain it for probabilities.
    120 linear = SVC(kernel='linear', probability=True)
    121 linear.fit(x_train, y_train)
    122 
    123 # We'll need class probabilities from each of the classifiers
    124 y_linear = linear.predict_proba(x_test)
    125 y_tree  = tree.predict_proba(x_test)
    126 y_knn   = knn.predict_proba(x_test)
    127 
    128 # Compute the points on the curve
    129 # We pass the probability of the second class (KIA) as the y_score
    130 curve_linear = sklearn.metrics.roc_curve(y_test, y_linear[:, 1])
    131 curve_tree   = sklearn.metrics.roc_curve(y_test, y_tree[:, 1])
    132 curve_knn    = sklearn.metrics.roc_curve(y_test, y_knn[:, 1])
    133 
    134 # Compute Area Under the Curve
    135 auc_linear = auc(curve_linear[0], curve_linear[1])
    136 auc_tree   = auc(curve_tree[0], curve_tree[1])
    137 auc_knn    = auc(curve_knn[0], curve_knn[1])
    138 
    139 plt.plot(curve_linear[0], curve_linear[1], label='linear (area = %0.2f)' % auc_linear)
    140 plt.plot(curve_tree[0], curve_tree[1], label='tree (area = %0.2f)' % auc_tree)
    141 plt.plot(curve_knn[0], curve_knn[1], label='knn (area = %0.2f)'% auc_knn)
    142 
    143 plt.xlim([0.0, 1.0])
    144 plt.ylim([0.0, 1.0])
    145 plt.xlabel('False Positive Rate')
    146 plt.ylabel('True Positive Rate')
    147 plt.title('ROC curve');
    148 
    149 plt.legend();
    150 ```
    151 
    152 Cross-validation:
    153 
    154 
    155 ```python
    156 from sklearn.model_selection import cross_val_score
    157 from sklearn.metrics import roc_auc_score, make_scorer
    158 
    159 # The cross_val_score function does all the training for us. We simply pass
    160 # it the complete data, the model, and the metric.
    161 
    162 linear = SVC(kernel='linear', probability=True)
    163 
    164 # Train for 5 folds, returing ROC AUC. You can also try 'accuracy' as a scorer
    165 scores = cross_val_score(linear, x, y, cv=3, scoring='roc_auc')
    166 
    167 print('scores per fold ', scores)
    168 ```
    169 
    170 Regression:
    171 
    172 ```python
    173 from sklearn import datasets
    174 from sklearn.metrics import mean_squared_error, r2_score
    175 
    176 # Load the diabetes dataset, and select one feature (Body Mass Index)
    177 x, y = datasets.load_diabetes(True)
    178 x = x[:, 2].reshape(-1, 1)
    179 
    180 # -- the reshape operation ensures that x still has two dimensions
    181 # (that is, we need it to be an n by 1 matrix, not a vector)
    182 
    183 x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.5)
    184 
    185 # feature space on horizontal axis, output space on vertical axis
    186 plt.scatter(x_train[:, 0], y_train)
    187 plt.xlabel('BMI')
    188 plt.ylabel('disease progression');
    189 
    190 # Train three models: linear regression, tree regression, knn regression
    191 from sklearn.linear_model import LinearRegression
    192 linear = LinearRegression()
    193 linear.fit(x_train, y_train)
    194 
    195 from sklearn.tree import DecisionTreeRegressor
    196 tree = DecisionTreeRegressor()
    197 tree.fit(x_train, y_train)
    198 
    199 from sklearn.neighbors import KNeighborsRegressor
    200 knn = KNeighborsRegressor(10)
    201 knn.fit(x_train, y_train);
    202 
    203 # Plot the models
    204 from sklearn.metrics import mean_squared_error
    205 
    206 plt.scatter(x_train, y_train, alpha=0.1)
    207 
    208 xlin = np.linspace(-0.10, 0.2, 500).reshape(-1, 1)
    209 plt.plot(xlin, linear.predict(xlin), label='linear')
    210 plt.plot(xlin, tree.predict(xlin), label='tree ')
    211 plt.plot(xlin, knn.predict(xlin), label='knn ')
    212 
    213 print('MSE linear ', mean_squared_error(y_test, linear.predict(x_test)))
    214 print('MSE tree ', mean_squared_error(y_test, tree.predict(x_test)))
    215 print('MSE knn', mean_squared_error(y_test, knn.predict(x_test)))
    216 
    217 plt.legend();
    218 ```
    219 
    220 Useful references:
    221 * [The official quickstart guide](http://scikit-learn.org/stable/tutorial/basic/tutorial.html)
    222 * [A DataCamp tutorial with interactive exercises](https://www.datacamp.com/community/tutorials/machine-learning-python)
    223 * [Analyzing text data with SKLearn](http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html)
	lectures.alex.balgavy.eu Lecture notes from university.
	git clone git://git.alex.balgavy.eu/lectures.alex.balgavy.eu.git
	Log \| Files \| Refs \| Submodules